From the maps of slums from several Indian cities, we extracted the area of the slum and calculated a Land Shape Index (LSI) as a measure of how suitable the shape of a land parcel is for construction. For many slums we also have the count of households. From these, household density can be calculated. In addition to this base data, we use a bunch of variables related to the cost of construction, sale price or rent, and other real estate information. Using these, we simulate a large number of hypothetical slums in each city and calculate financial feasibility metrics for each. This data is not intended to be a representative sample of all the slums, but captures a good amount of the diversity in the characteristics of the slums and the state of the real estate market. As a result, what the simulation gives us is a broad range of possible scenarios across the cities with different kinds of financial outcomes if a redevelopment is attempted. This leads us to an understanding of the factors that drive financial feasibility.
The slum characteristics data contains slum area, LSI and slum size (number of households) for the cities Bangalore, Belgaum, Bellary, Bhopal, Davanagere, Delhi, Gulbarga, Hubli-Dharwad, Hyderabad, Mangalore, Mumbai, Shimoga, and Tumkur. Slum density can then be calculated by dividing the size by area. However, not all information is available for all the cities.
The plot below shows the fractions of missing data for different cities for the key variables slum LSI, area and size.
There are many records with only the size, with both the LSI and area missing. Here we look at the distribution of the sizes for both these categories (other variables missing and not missing)
There is a good overlap between the distributions for the missing and not missing categories.
For some cities, size is completely missing. For Delhi, we have size for a small number of records. Here we look at how the land shape and the size variables are distributed for the Delhi slums with and without the size variable.
For multiple cities, slum size information is missing. We use a missing data imputation method to try and impute these values. To check whether data is missing at random, we could compare the distribution of the other variables for the part of the data for which the slum size information is missing with the part with the slum size information available. The plot below shows that there is a good degree of overlap between the two, although they are not perfectly aligned with each other.
We use the nonparametric missing value imputation algorithm named missForest (Stekhoven and Bühlmann 2011). This is known to be a very effective imputation method that requires very little fine-tuning.
To empirically validate the method missForest, We take the subset of the data with no missing values and do a random deletion of 20% of slum sizes from this data. Then we measure how correctly the algorithm is able to recover this data.
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
From the scatter plot, we see that for a majority of the slums, the imputed value of size is quite close to the actual value. The Mean Absolute Error of imputation is around 175. The heatmap tells us the majority of the points are close to the 45-degree line. Considering this to be sufficiently accurate for the purpose of this study, we proceed with doing this imputation on the entire data.
Another limitation in the data is that some of the smaller cities have only a small number of slums captured in the data set. In some cases, it reflects the ground reality that these cities have much fewer slums compared to the large urban centres. However, for this simulation study, it will be helpful to have more examples of slums for the smaller cities as well, so that we get a clearer picture of the impact of the slum characteristics on the financial outcome of the redevelopment. For instance, when we analyse the data for Mangalore, we do not expect to gain insights for Mangalore alone, but for all other cities that may have slums with the same characteristics as those in Mangalore.
To handle this, we generate more “synthetic” data points for the smaller cities that are similar to the actual data for those cities, but with small variations. We use a popular oversampling method called Synthetic Minority Over-sampling TEchnique - SMOTE (Chawla et al. 2002).
Through the stages of data preparation, we fill in many missing values and generate new data for the cities that had fewer number of samples. After these steps, we do a visual inspection to understand how much the distributions of different variables were impacted by these. We expect to see that the distributions stay more or less stable through the stages of data preparation.
We simulate 5000 data points for each city. Each of these simulated points represents a unique scenario of in-situ redevelopment. If all the households cannot be accommodated in the total buildable area, we mark the project as infeasible upfront. If redevelopment is feasible, we we go ahead and calculate the financial feasibility metric. In the case of sale of premium housing, the metric is the internal rate of return (IRR) of the promoter, and for renting of commercial buildings, it is Net Present Value (NPV).
This plot shows the range of values each of the variables take after the simulation, for each city.
The below plots show the distributions for the two categories of cities:
This plot shows the distribution of the variables by the financial feasibility and profitability.
We also look at the combined plots for the two categories of cities:
When we consider multiple variables that influence financial feasibility, it is possible to think of ranges or regions of the values of these variables that are associated with higher chances of profitability.
We show how the profitable, infeasible and the non-profitable cases are distributed for the scenario of redevelopment leading to the sale of premium housing. The below plot shows the 5% trimmed intervals of multiple variables for the three situations.
Now we look at the patterns taking pairs of variables.
Non-profitable, infeasible and profitable assuming rental of commercial buildings. Net present value (NPV) is used as a measure of profitability.
To assess the impact of different variables on the financial feasibility, we fit a decision tree model to the data and visualise the decision points in the tree. A sample of size 500 from each of the three outcomes are used to fit the decision tree. A single decision tree cannot capture all the intricacies in the data, but it gives us an approximate picture of the major patterns. The terminal nodes of the tree are integrated with a heatmap indicating the range of the values of the variables at different regions of financial feasibility. This is done using treeheatr, an R package for interpretable decision tree visualizations (Le and Moore 2020). For fitting the decision tree, we use the R package rpart (Therneau, Atkinson, and Ripley 2019).
An alternative to the decision trees for explaining the pattern involving multiple covariates and an outcome would be a rule learning algorithm. Below we show the results of running the RIPPER (repeated incremental pruning to produce error reduction) algorithm on the data for large cities.
For more details on the method, see: https://christophm.github.io/interpretable-ml-book/rules.html
##
## === Summary ===
##
## Correctly Classified Instances 898 89.8 %
## Incorrectly Classified Instances 102 10.2 %
## Kappa statistic 0.7483
## Mean absolute error 0.0944
## Root mean squared error 0.2172
## Relative absolute error 38.7049 %
## Root relative squared error 62.29 %
## Total Number of Instances 1000
##
## === Confusion Matrix ===
##
## a b c <-- classified as
## 708 65 8 | a = infeasible
## 8 116 11 | b = not profitable
## 3 7 74 | c = profitable
## JRIP rules:
## ===========
##
## (slum_land_shape <= 0.04871) and (sale_price_rights >= 24324.853229) => fin_outcome=profitable (43.0/15.0)
## (slum_density <= 181.64794) and (sale_price_building >= 98189.823875) and (slum_land_shape <= 0.095918) => fin_outcome=profitable (21.0/0.0)
## (slum_land_shape <= 0.053659) and (slum_density <= 291.745692) and (sale_price_building >= 70743.639922) and (floor_area_ratio >= 2.202073) and (slum_size <= 532) => fin_outcome=profitable (14.0/2.0)
## (slum_density <= 175.161478) and (constr_cost_prem_housing <= 43972.60274) and (floor_area_ratio >= 2.412367) => fin_outcome=profitable (10.0/2.0)
## (slum_land_shape <= 0.040437) and (sale_price_building >= 153082.191781) and (slum_density <= 310.463995) and (floor_area_ratio >= 2.358254) => fin_outcome=profitable (5.0/0.0)
## (slum_land_shape <= 0.050586) => fin_outcome=not profitable (156.0/65.0)
## (floor_area_ratio >= 2.539697) and (slum_density <= 210.727969) => fin_outcome=not profitable (17.0/5.0)
## (slum_land_shape <= 0.060522) and (slum_density <= 329.946683) and (floor_area_ratio >= 2.051092) and (slum_land_shape >= 0.051549) => fin_outcome=not profitable (15.0/2.0)
## => fin_outcome=infeasible (719.0/11.0)
##
## Number of Rules : 9
We look at how sensitive profitability is to changes in subsidies and FAR, by calculating the fraction of profitable projects in the simulation by keeping these two variables constant, at different levels.
This work is licensed under CC BY 4.0. Copyright: Swastik Harish and Sooraj Raveendran. 2021
This report was prepared in R markdown. The source code and the data are available at https://github.com/soorajmr/issr-simulation. The code is under MIT licence.
Report generated at: 01 October, 2021 03:41 PM